Skip to content

fix(plan-reviews): restore RECOMMENDATION + Completeness split + Codex ELI10 (v1.6.3.0)#1149

Merged
garrytan merged 8 commits intomainfrom
garrytan/plan-review-regressions
Apr 23, 2026
Merged

fix(plan-reviews): restore RECOMMENDATION + Completeness split + Codex ELI10 (v1.6.3.0)#1149
garrytan merged 8 commits intomainfrom
garrytan/plan-review-regressions

Conversation

@garrytan
Copy link
Copy Markdown
Owner

@garrytan garrytan commented Apr 22, 2026

Summary

Two-part fix for AskUserQuestion format regressions in /plan-ceo-review and /plan-eng-review, measured on both Claude Opus 4.7 and Codex (GPT-5.4).

v1.6.2.0 — Claude regression. A user on Opus 4.7 reported /plan-ceo-review and /plan-eng-review stopped showing the RECOMMENDATION: Choose X line and the Completeness: N/10 per-option score. Investigation showed the real failure mode: on kind-differentiated questions (mode selection, architectural A-vs-B, cherry-pick Add/Defer/Skip), Opus 4.7 was fabricating filler scores (10/10 on every option, conveys nothing) or dropping the format when the metric didn't fit. Fix splits Completeness: N/10 application by question type: coverage-differentiated options get scores, kind-differentiated options get Note: options differ in kind, not coverage — no completeness score. instead.

v1.6.3.0 — Codex follow-up. User reported Codex (GPT-5.4) was failing the same pattern 10/10 times — skipping the ELI10 explanation and the RECOMMENDATION line on AskUserQuestion calls, forcing manual "ELI10 and don't forget to recommend" re-prompts every time. Root cause: the gpt.md model overlay's "No preamble / Prefer doing over listing" rule was training Codex to skip the exact prose the user needs for decision-making. Fix adds a "AskUserQuestion is NOT preamble" carve-out to gpt.md and hardens step 2 of the AskUserQuestion Format rule ("Simplify (ELI10, ALWAYS)" with explicit "not optional verbosity" framing).

Test Coverage

Two new periodic-tier eval files, 4 cases each, pinned to the model family under test:

Claudetest/skill-e2e-plan-format.test.ts (claude-opus-4-7):

Case Type Pre-fix Post-fix
plan-ceo-review mode selection kind ✗ fabricated 10/10 on all 4 modes ✓ RECOMMENDATION + "options differ in kind" note
plan-ceo-review approach menu coverage ✗ regex missed **bolded** ✓ RECOMMENDATION + Completeness: 5/7/10
plan-eng-review coverage issue coverage ✓ passed ✓ passes
plan-eng-review kind issue kind ✗ fabricated 9/9/5 on kind question ✓ RECOMMENDATION + "options differ in kind" note

Codextest/codex-e2e-plan-format.test.ts (codex-cli via codex exec):

Case Type Pre-fix (measured, 10/10 fail) Post-fix (v1.6.3.0)
plan-ceo-review mode selection kind No ELI10, no RECOMMENDATION ✓ ELI10 + RECOMMENDATION + "options differ in kind"
plan-ceo-review approach menu coverage Bare options list ✓ ELI10 + RECOMMENDATION + Completeness: 5/7/10
plan-eng-review coverage issue coverage Bare options list ✓ ELI10 + RECOMMENDATION + Completeness
plan-eng-review kind issue kind Fabricated filler on kind ✓ ELI10 + RECOMMENDATION + kind note

Eval pass record

Pass Result Cost Duration
Phase 1 baseline — Claude (pre-fix) 1/4 assertions pass (evidence) $2.19 332s
Phase 3 post-fix — Claude 4/4 pass $1.84 274s
Phase 3b regression sweep — skill-e2e-plan.test.ts 12/12 pass, no drift $5.19 1484s
Codex eval (v1.6.3.0 fix applied) 4/4 pass $0 (Codex billing) 517s

Pre-Landing Review

Three plan-phase reviews completed:

  • CEO Review (HOLD_SCOPE): 4 findings raised, 3 folded into plan, 0 critical gaps.
  • Eng Review (FULL_REVIEW): 3 issues found, all folded — completeness-section conflict resolved, phantom template anchor corrected, cross-skill regression sweep added.
  • DX Review (TRIAGE): score 6/10 → 8/10. Critical finding folded in: don't fabricate Completeness: X/10 on kind-differentiated questions.

Plan Completion

All phases shipped:

  • Phase 1 (baseline eval) — landed, captured regression evidence.
  • Phase 2 (preamble + template fix) — resolver split, both preamble locations synchronized, 3 template anchors.
  • Phase 3 (re-run eval) — 4/4 pass on Claude.
  • Phase 3b (regression sweep) — 12/12 pass on direct neighbor.
  • Follow-up scope (Codex) — gpt.md carve-out, 4 new Codex eval cases, 4/4 pass.

Phase 4 (literal in-template scaffolding fallback) not needed.

Verification Results

  • bun test — 448+ passing, 0 failing after golden fixture refresh.
  • gen-skill-docs --host all — clean across all hosts (claude, codex, factory, gbrain, gpt-5.4, hermes, kiro, opencode, openclaw, slate, cursor).
  • Claude eval: 4/4 pass on Opus 4.7.
  • Claude regression sweep: 12/12 pass on skill-e2e-plan.test.ts.
  • Codex eval: 4/4 pass on GPT-5.4 via codex exec.

Test plan

  • All free tests pass (bun test — 448+ tests, host-config goldens refreshed)
  • Phase 1 baseline eval captured Claude regression (3/4 format assertions fail pre-fix)
  • Phase 3 post-fix Claude eval: 4/4 pass
  • Phase 3b regression sweep: 12/12 pass (skill-e2e-plan.test.ts, ~$5 spend, no drift)
  • Codex eval: 4/4 pass (ELI10 + RECOMMENDATION + correct coverage-vs-kind)
  • All T2 skills regenerated consistently across all hosts
  • Golden fixtures refreshed (claude-ship, codex-ship, factory-ship)

🤖 Generated with Claude Code

garrytan and others added 4 commits April 22, 2026 01:10
Four-case periodic-tier eval that captures the verbatim AskUserQuestion
text /plan-ceo-review and /plan-eng-review produce, then asserts the
format rule is honored: RECOMMENDATION always, Completeness: N/10 only
on coverage-differentiated options, and an explicit "options differ in
kind" note on kind-differentiated options.

Cases:
- plan-ceo-review mode selection (kind-differentiated)
- plan-ceo-review approach menu (coverage-differentiated)
- plan-eng-review per-issue coverage decision
- plan-eng-review per-issue architectural choice (kind-differentiated)

Classified periodic because behavior depends on Opus non-determinism —
gate-tier would flake and block merges.

Test harness instructs the agent to write its would-be AskUserQuestion
text to $OUT_FILE rather than invoke a real tool (MCP AskUserQuestion
isn't wired in the test subprocess). Regex predicates then validate
the captured content.

Cost: ~$2 per full run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…stion type

Opus 4.7 users reported /plan-ceo-review and /plan-eng-review stopped
emitting the RECOMMENDATION line and per-option Completeness: X/10
scores. E2E capture showed the real failure mode: on kind-differentiated
questions (mode selection, architectural A-vs-B, cherry-pick), Opus 4.7
either fabricated filler scores (10/10 on every option — conveys nothing)
or dropped the format entirely when the metric didn't fit.

Fix is at two layers:

1. scripts/resolvers/preamble/generate-ask-user-format.ts splits the old
   run-on step 3 into:
   - Step 3 "Recommend (ALWAYS)": RECOMMENDATION is required on every
     question, coverage- or kind-differentiated.
   - Step 4 "Score completeness (when meaningful)": emit Completeness: N/10
     only when options differ in coverage. When options differ in kind,
     skip the score and include a one-line explanatory note. Do not
     fabricate scores.

2. scripts/resolvers/preamble/generate-completeness-section.ts updates
   the Completeness Principle tail to match. Without this, the preamble
   contained two rules (one conditional, one unconditional) and the
   model hedged.

Template anchors reinforce the distinction where agent judgment is most
likely to drift:

- plan-ceo-review Section 0C-bis (approach menu) gets the
  coverage-differentiated anchor.
- plan-ceo-review Section 0F (mode selection) gets the kind-differentiated
  anchor.
- plan-eng-review CRITICAL RULE section gets the coverage-vs-kind rule
  for every per-issue AskUserQuestion raised during the review.

Regenerated SKILL.md for all T2 skills + golden fixtures refreshed. Every
skill using the T2 preamble now has the same conditional scoring rule.

Verified via new periodic-tier eval (test/skill-e2e-plan-format.test.ts):
all 4 cases fail on prior behavior, all 4 pass with this fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 22, 2026

E2E Evals: ✅ PASS

68/68 tests passed | $8.74 total cost | 12 parallel runners

Suite Result Status Cost
e2e-browse 7/7 $0.33
e2e-deploy 6/6 $1.35
e2e-design 3/3 $0.48
e2e-plan 8/8 $1.59
e2e-qa-workflow 3/3 $1.3
e2e-review 6/6 $1.32
e2e-workflow 4/4 $0.52
llm-judge 25/25 $0.5
e2e-deploy 6/6 $1.35

12x ubicloud-standard-2 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite

garrytan and others added 4 commits April 22, 2026 21:34
Four-case periodic-tier eval mirrors test/skill-e2e-plan-format.test.ts
but drives the plan review skills via codex exec instead of claude -p.

Context: Codex under the gpt.md "No preamble / Prefer doing over listing"
overlay tends to skip the Simplify/ELI10 paragraph and the RECOMMENDATION
line on AskUserQuestion calls. Users have to manually re-prompt "ELI10
and don't forget to recommend" almost every time. This test pins the
behavior so regressions surface.

Cases:
- plan-ceo-review mode selection (kind-differentiated)
- plan-ceo-review approach menu (coverage-differentiated)
- plan-eng-review per-issue coverage decision
- plan-eng-review per-issue architectural choice (kind-differentiated)

Assertions on captured AskUserQuestion text:
- RECOMMENDATION: Choose present (all cases)
- Completeness: N/10 present on coverage, absent on kind
- "options differ in kind" note present on kind
- ELI10 length floor (>400 chars) — catches bare options-only output

Cost: ~\$2-4 per full run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Follow-up to v1.6.2.0. Codex (GPT-5.4) under the gpt.md overlay
treated "No preamble / Prefer doing over listing" as license to skip
the Simplify paragraph and the RECOMMENDATION line on AskUserQuestion
calls. Users had to manually re-prompt "ELI10 and don't forget to
recommend" almost every time.

Two layers:

1. model-overlays/gpt.md — adds an explicit "AskUserQuestion is NOT
   preamble" carve-out. The "No preamble" rule applies to direct
   answers; AskUserQuestion content must emit the full format
   (Re-ground, Simplify/ELI10, Recommend, Options). Tells the model:
   if you find yourself about to skip any of these, back up and emit
   them — the user will ask anyway, so do it the first time.

2. scripts/resolvers/preamble/generate-ask-user-format.ts — step 2
   renamed to "Simplify (ELI10, ALWAYS)" with explicit "not optional
   verbosity, not preamble" framing. Step 3 "Recommend (ALWAYS)"
   hardened: "Never omit, never collapse into the options list."

All T2 skills regenerated across all hosts. Golden fixtures refreshed
(claude-ship, codex-ship, factory-ship). Updated the ELI10 assertion
in test/gen-skill-docs.test.ts to match the new wording.

Codex compliance to be verified empirically via test/codex-e2e-plan-format.test.ts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two test infrastructure bugs in the initial Codex eval landed in the
prior commit:

1. sandbox: 'read-only' (the default) blocked Codex from writing
   $OUT_FILE. Test reported "STATUS: BLOCKED" and exited 0 without
   a capture file. Fixed: sandbox: 'workspace-write' for all 4 cases,
   allowing writes inside the tempdir.

2. recordCodexResult called a non-existent evalCollector.record()
   API (I invented it). The real surface is addTest() with a
   different field schema. Aligned with test/codex-e2e.test.ts
   pattern.

With both fixed, the eval now actually measures Codex AskUserQuestion
format compliance. All 4 cases pass on v1.6.2.0 with the gpt.md
carve-out: RECOMMENDATION always, Completeness: N/10 only on coverage,
"options differ in kind" note on kind, ELI10 explanation present.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the Codex ELI10 + RECOMMENDATION carve-out scope landed after
v1.6.2.0's Claude-verified fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@garrytan garrytan changed the title fix(plan-reviews): restore RECOMMENDATION + split Completeness by question type (v1.6.2.0) fix(plan-reviews): restore RECOMMENDATION + Completeness split + Codex ELI10 (v1.6.3.0) Apr 23, 2026
@garrytan garrytan merged commit 69733e2 into main Apr 23, 2026
20 checks passed
garrytan added a commit that referenced this pull request Apr 23, 2026
….6.4.0

Main shipped v1.6.3.0 (Codex ELI10 + RECOMMENDATION fix, #1149) and also took the v1.6.2.0 version slot (plan-reviews RECOMMENDATION + Completeness split) while this branch was at 1.6.2.0 without a CHANGELOG entry. Version-number collision resolved per CLAUDE.md: branch bumps above main's latest, accepts main's two new CHANGELOG entries.

VERSION: 1.6.4.0 (above main's 1.6.3.0).
package.json: synced to 1.6.4.0.
CHANGELOG: main's v1.6.3.0 + v1.6.2.0 entries accepted, placed above our v1.5.2.0 entry in reverse-chronological order.

Auto-merged: many SKILL.md regenerations from main's preamble changes. No real conflicts in security source files.

Security test suite: 87 pass, 0 fail post-merge (security.test.ts + content-security.test.ts).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant